# Georeferenced Interview Transcriptions from the Nakba Archive

## Contributors & Citation

Annie K. Lamar*, Rick Castle**, Carissa Chappell***, Emmanouela Schoinoplokaki**, Allene Seet**

*Assistant Professor of Computational Classics, UC Santa Barbara, 
**PhD Candidate, Department of Classics, UC Santa Barbara, 
***PhD Student, Department of Classics, UC Santa Barbara

All contributors are members of the Low-Resource Language (LOREL) Lab at the University of California, Santa Barbara. All authors contributed equally.

### Recommended Citation

Castle, Rick; Chappell, Carissa; Schoinoplokaki, Emmanouela; Seet, Allene; Lamar, Annie K., 2024, "Georeferenced Interview Transcriptions from the Nakba Archive", Harvard Dataverse, V1. 

## Introduction

This dataset presents manually georeferenced interview transcripts from the Nakba Archive [1]. The Nakba Archive is a collection of 500 video interviews that documents the stories of Palestinians forcibly displaced during the *Nakba*, or catastrophe, of 1948. Of these 500 video interviews, thirty have been transcribed and translated into English. Our manual structuring, annotation, and georeferencing of this dataset allows these stories to be used by other computational researchers in pursuit of a peaceful world. 


## Data Transcription and Structuring

The Nakba Archive [1] has translated (from Arabic to English) and transcribed thirty oral interviews. These English-language interviews are available as PDFs and the Nakba Archive webpages contain metadata about each interview and participant (see below). We preserved the PDF documents during data structuring for data stability. Each transcript was transformed into a data object. We separated each interviewer prompt and interviewee response. We annotated footnotes and geographic locations manually.

## Georeferencing Process

Since out-of-the-box NER models have a high error rate on many smaller places in the eastern Mediterranean and struggle to properly differentiate names that might refer to multiple geographic locations, we manually annotated all text in the thirty transcribed interviews from the Nakba Archive. After tagging all place names in the text, we created a dataset of all places and performed manual georeferencing. 

Georeferencing included adding latitude and longitude points of each location mentioned in the transcripts to the dataset `nakba_archive_places.csv` under `latitude` and `longitude`. We obtained coordinates using WikiData if available; we used other sources if not. If the coordinates were available through WikiData or Wikipedia, the link is included in the dataset; if not, the `data_notes` variable explains where we obtained the geographic information.

## Files and Variables

Our dataset is divided into four files:
1. **nakba_archive_places.csv**: Contains a complete list of all places in the dataset, including geographic coordinates and place-specific metadata.
2. **nakba_archive_interviews.csv**: Contains the text of all the transcribed English-language interviews from the Nakba archive.
3. **nakba_archive_footnotes.csv**: Contains the labeled text of the footnotes provided in the transcriptions of the interviews.
4. **nakba_archive_metadata.csv**: Contains information about the interviews and participants. 

### Places
For each place, we provide the following information:
- place_ID: Unique identifier for this place.
- place_name: Name of the place.
- alt_names: Other names for the place that are either common or used in the text of the transcribed interviews.
- latitude: Geographic latitude of place.
- longitude: Geographic longitude of place.
- place_type: Reflects language used to describe place on Wikipedia listing, if available; otherwise, reflects best judgement of contributors; options include Continent, Region, Country, City, Town, Village (current), Village (-1948), Moshav, Camp, Neighborhood, or Feature. "Feature" includes all specific, names locations such as schools, shops, bridges, etc. Village (-1948) describes a village that was destroyed in the violence of 1948.
- Wiki_link: A stable link to the wikipedia listing for the place, if available.
- data_notes: Any notes about interpretations or assumptions made in georeferencing this place.

### Interviews
For each interview, we provide the following:
- interview_ID: dentifier for interview to which this text belongs.
- response_ID: Unique identifier for this response *within this interview*. 
- prompt: Prompt given by interviewer.
- response: Response given by interviewee.

### Footnotes
For all footnotes, we provide the following:
- interview_ID: Identifier for interview to which this footnote belongs.
- footnote_num: Number of footnotes within interview; one-indexed.
- footnote_text: Text of footnote.

### Metadata
We provide the following metadata:
- interview_ID: Unique identifier for this interview.
- interviewer_name: Name of the interviewer.
- interviewee_name: Name of the person being interviewed.
- interviewee_alt_name: Alternative name for person being interviewed.
- interviewer_gender: Gender of the interviewer.
- interviewee_gender: Gender of the interviewee.
- interview_location: Location of the interview (included in places.csv).
- interview_location_ID: Unique identifier for the location as marked in the places dataset.
- interview_month: Month the interview took place.
- interview_day: Day the interview took place.
- interview_year: Year the interview took place.
- Birthyear: Birthyear of the interviewee.
- Birthplace: Birthplace of the interviewee.
- birthplace_location_ID: Unique identifier for the birthplace as marked in the places dataset.

## References

[1] Nakba Archive. nakba-archive.org.